SQL Server 2008 : Failover Clustering

10/19/2010 3:40:53 PM

Failover clustering is a technique that uses a cluster of SQL Server instances to protect against failure of the instance currently serving your users. Failover clustering is based on a hardware solution comprised of multiple servers (known as nodes) that share the same disk resources. One server is active and owns the database. If that server fails, then another server in the cluster will take over ownership of the database and continue to serve users.

1. Key Terms

When discussing high availability, each technique has its own set of key terms. At the beginning of each section, we will list the terms used for each solution. Here are some of the terms you need to be familiar with when setting up a failover cluster:

Node: Server that participates in the failover cluster.
Resource group: Shared set of disks or network resources grouped together to act as a single working unit.
Active node: Node that has ownership of a resource group.
Passive node: Node that is waiting on the active node to fail in order to take ownership of a resource group.
Heartbeat: Health checks sent between nodes to ensure the availability of each node.
Public network: Network used to access the failover cluster from a client computer.
Private network: Network used to send heartbeat messages between nodes.
Quorum: A special resource group that holds information about the nodes, including the name and state of each node.

2. Failover Clustering Overview

You can use failover clustering to protect an entire instance of SQL Server. Although the nodes share the same disks or resources, only one server may have ownership (read and write privileges) of the resource group at any given time. If a failover occurs, the ownership is transferred to another node, and SQL Server is back up in the time it takes to bring the databases back online. The failover usually takes anywhere from a few seconds to a few minutes, depending on the size of the database and types of transactions that may have been open during the failure.

In order for the database to return to a usable state during a failover, it must go through a Redo phase to roll forward logged transactions and an Undo phase to roll back any uncommitted transactions. Fast recovery is an Enterprise Edition feature that was introduced in SQL Server 2005 that allows applications to access the database as soon as the Redo phase has completed. Also, since the cluster appears on the network as a single server, there is no need to redirect applications to a different server during a failover. The network abstraction combined with fast recovery makes failing over a fairly quick and unobtrusive process that ultimately results in less downtime during a failure.

The number of nodes that can be added to the cluster depends on the edition of the operating system (OS) as well as the edition of SQL Server, with a maximum of 16 nodes using SQL Server 2008 Enterprise Edition running on Windows Server 2008. Failover clustering is only supported in the Enterprise and Standard Editions of SQL Server. If you are using the Standard Edition, you are limited to a 2-node cluster. Since failover clustering is also dependant on the OS, you should be aware of the limitations for each edition of the OS as well. Windows Server 2008 only supports the use of failover clustering in its Enterprise, Datacenter, and Itanium Editions. Windows Server 2008 Enterprise and Datacenter Editions both support a 16-node cluster, while the Itanium edition only supports an 8-node cluster.

In order to spread the workload of multiple databases across servers, every node in a cluster can have ownership of its own set of resources and its own instance of SQL Server. Every server that owns a resource is referred to as an active node, and every server that does not own a resource is referred to as a passive node. There are two basic types of cluster configurations: a single-node cluster and a multi-node cluster. A multi-node cluster contains two or more active nodes, and a single-node cluster contains one active node with one or more passive nodes. Figure 1 shows a standard single-node cluster configuration commonly referred to as an active/passive configuration.

Figure 1. Common single-node (active/passive) cluster configuration

So if multiple active nodes allow you to utilize all of your servers, why wouldn't you make all the servers in the cluster active? The answer: Resource constraints. In a 2-node cluster, if one of the nodes has a failure, the other available node will have to process its normal load as well as the load of the failed node. For this reason, it is considered best practice to have one passive node per active node in a failover cluster. Figure 2 shows a healthy multi-node cluster configuration running with only two nodes. If Node 1 has a failure, as demonstrated in Figure 3 , Node 2 is now responsible for both instances of SQL Server. While this configuration will technically work, if Node 2 does not have the capacity to handle the workload of both instances, your server may slow to a crawl, ultimately leading to unhappy users in two systems instead of just one.

Figure 2. Multi-node (active/active) cluster configuration

Figure 3. Multi-node (active/active) cluster configuration after failure

So, how does all of this work? A heartbeat signal is sent between the nodes to determine the availability of one another. If one of the nodes has not received a message within a given time period or number of retries, a failover is initiated and the primary failover node takes ownership of the resources. It is the responsibility of the quorum drive to maintain a record of the state of each node during this process. Heartbeat checks are performed at the OS level as well as the SQL Server level. The OS is in constant contact with the other nodes, checking the health and availability of the servers. For this reason, a private network is used for the heartbeat between nodes to decrease the possibility of a failover occurring due to network-related issues. SQL Server sends messages known as LooksAlive and IsAlive. LooksAlive is a less intrusive check that runs every 5 seconds to make sure the SQL Server service is running. The IsAlive check runs every 60 seconds and executes the query Select @@ServerName against the active node to make sure that SQL Server can respond to incoming requests.

3. Implementation

Before you can even install SQL Server, you have to make sure that you have configured a solid Windows cluster at the OS and hardware levels. One of the major pain points you used to have when setting up a failover cluster is searching the Hardware Compatibility List (HCL) to ensure that the implemented hardware solution would be supported. This requirement has been removed in Windows Server 2008. You can now run the new cluster validation tool to perform all the required checks that will ensure you are running on a supported configuration. Not only can the cluster validation tool be used to confirm the server configuration, it can be used to troubleshoot issues after setup as well.

From a SQL Server perspective, the installation process is pretty straightforward. If you are familiar with installing a failover cluster in SQL Server 2005, it has completely changed in SQL Server 2008. First you run the installation on one node to create a single-node cluster. Then you run the installation on each remaining node, choosing the Add Node option during setup. In order to remove a node from a cluster, you run setup.exe on the server that needs to be removed and select the Remove Node option. The new installation process allows for more granular manageability of the nodes, allowing you to add and remove nodes without bringing down the entire cluster. The new installation process also allows you to perform patching and rolling upgrades with minimal downtime.

Do not be afraid of failover clustering. It takes a little extra planning up front, but once it has been successfully implemented, managing a clustered environment is really not much different than any other SQL environment. Just as the SQL instance has been presented to the application in a virtual network layer, it will be presented to the administrator this way as well. As long as you use all the correct virtual names when connecting to the SQL instance, it will just be administration as usual.

4. Pros and Cons of Failover Clustering

As with any other technology solution, failover clustering brings certain benefits, but at a cost. Benefits of failover clustering include the following: